Back-off language model compression
نویسندگان
چکیده
With the availability of large amounts of training data relevant to speech recognition scenarios, scalability becomes a very productive way to improve language model performance. We present a technique that represents a back-off n-gram language model using arrays of integer values and thus renders it amenable to effective block compression. We propose a few such compression algorithms and evaluate the resulting language model along two dimensions: memory footprint, and speed reduction relative to the uncompressed one. We experimented with a model that uses a 32-bit word vocabulary (at most 4B words) and log-probabilities/back-off-weights quantized to 1 byte, respectively. The best compression algorithm achieves 2.6 bytes/n-gram at≈18X slower than uncompressed. For faster LM operation we found it feasible to represent the LM at ≈4.0 bytes/n-gram, and ≈3X slower than the uncompressed LM. The memory footprint of a LM containing one billion ngrams can thus be reduced to 3-4 Gbytes without impacting its speed too much.
منابع مشابه
Quantization-based language model compression
This paper describes two techniques for reducing the size of statistical back-off -gram language models in computer memory. Language model compression is achieved through a combination of quantizing language model probabilities and back-off weights and the pruning of parameters that are determined to be unnecessary after quantization. The recognition performance of the original and compressed l...
متن کاملLanguage model adaptation using minimum discrimination information
In this paper, adaptation of language models using the minimum discrimination information criteria is presented. Language model probabilities are adapted based on unigram, bigram and trigram features using a modified version of the generalized iterative scaling algorithm. Furthermore, a language model compression algorithm, based on conditional relative entropy is discussed. It removes probabil...
متن کاملN-Gram Language Model Compression Using Scalar Quantization and Incremental Coding
This paper describes a novel approach of compressing large trigram language models, which uses scalar quantization to compress log probabilities and back-off coefficients, and incremental coding to compress entry pointers. Experiments show that the new approach achieves roughly 2.5 times of compression ratio compared to the well-known tree-bucket format while keeps the perplexity and accessing ...
متن کاملExperience with a Stack Decoder-Based HMM CSR and Back-Off N-Gram Language Models
Stochastic language models are more useful than nonstochastic models because they contribute more information than a simple acceptance or rejection of a word sequence. Back-off N-gram language models [ I l l are an effective class of word based stochastic language model. The first part of this paper describes our experiences using the back-off language models in our time-synchronous decoder CSR...
متن کاملOn enhancing katz-smoothing based back-off language model
Though the statistical language modeling plays an important role in speech recognition, there are still many problems that are difficult to be solved such as the sparseness of training data. Generally, two kinds of smoothing approaches, namely the back-off model and the interpolated model, have been proposed to solve the problem of the impreciseness of language models caused by the sparseness o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009